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A study of the reliability of the proficiency ratings 
scale and techniques used by three federal government agencies — the 
Central Intelligence Agency, the Defense Language Institute, and the 
Foreign Service Institute (FSI)--to test employes' oral language 
proficiency in French and German had two randomly selected two-person 
teams of testers from each agency test 20 subjects for each language. 
The ratings assigned to the subjects were compared with an expected 
rating distribution. Results indicated no statistical difference in 
the ratings across agencies, either for the combined languages or for 
each language separately. However, ratings in various sub-portions of 
the proficiency scale showed clear across-agency differences, 
generally reflecting relatively higher ratings on the part of FSI 
raters and, occasionally, wide discrepancies in scoring for 
individual examinees. Findings also indicated that, despite the 
feeling that their language proficiency had bees adequately probed, a 
significant number of examinees felt that the FSI procedure was more 
anxiety-producing than the others and that it used some "unfair" 
eliciting techniques. It is concluded that, although the study 
examines only the test-retest comparability of the interviewing 
process across agencies, there is a need for further research if the 
procedure is to be used as an inter-agency measure. (MSE) 



* ****************************************** * ************** *********** 

* Reproductions supplied by EDRS are the best that can be made 

* from the original document. 

************ f ******************************************************** 



-a- 

1^ A Study of the Comparability 

<\i of Speaking Proficiency Interview Ratings 



UJ 



Across Three Government Language Training Agencies 



John L. D. Clark 
Center for Applied Linguistics 



January 1986 



BEST COPY AVAILABLE 



1118 22nd Street, N.W. 
Washington, DC 20037 



N 

u 

ERIC 



U S DEPARTMENT OF EDUCATION 

NATIONAL INSTITUTE OF EOUCATION 

EDUCATIONAL RESOURCES INFORMATION 
CENTER (ERIC) 
This document has oeen 'oproouced as 
received from the person or organization 
or-jinatmg it 

Minor changes have been made to improve 
reproduction quality 



• Points of view or opinions stated in this docu 
ment do not necessarily represent official NlE 
position or policy 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



A Study of the Comparability of Speaking Proficiency Interview Ratings 
Across Three Government Language Training Agencies 



John L r ciark 
Center for App. mguistics 1 



BACKGROUND 

A pervasive question in the operational use and interpretation of the 
results of speaking pro-ficiency interviews based on the "ILR" (Interagency 
Language Roundtable) proficiency level descriptions is the extent to which 
given examinees' performances would be evaluated in a similar manner across 
the variety of government agencies and other institutions that make use of 
this testing procedure. Although there has been a fair amount of conjecture 
and internal discussion of this topic on the part of examiners and 
administrators involved m the-day-to-day implementation of agency testing 
programs, there has not until recently been an opportunity to address the 
-comparabi I i ty-of -rating- question in a straightforward empirical manner. 

The following is a description of the procedures and major results of a 
direct experimental comparison of the proficiency ratings assigned to a corrmon 
group of examinees by testers in each of three goverrwnt language training 
agencies: CIA, DLI , and FSI; for each of two languages: French and German. 
Also discussed are the extent to which the results of this particular study 
might legitimately be extrapolated, cautions on areas in which extrapolation 
would not be appropriate, and reconmendat i ons for follow-up investigation of 
other aspects of reliability and validity of the interview testing process not 
formally addressed in the present study. 



•The assistance of a number of other persons in the conduct of this study is 
gratefully acknowledged. Among the CAL staff, Lynn E. Thompson provided 
very effective administrative assistance during all phases of the project. 
Christina Garbacz had major responsibility for data entry as well as for 
various aspects of statistical processing, and Rebecca Oxford contributed 
substantial ly to project planning and procedures specification. Nina Levmson 
(CIA), Thea Bruhn (FSI), and Ellen Mitchell and Phi I hp White (DLI) coordinated 
the interviewing activities at their respective agencies with a high level of 
dihge e and effectiveness, and a debt of appreciation is owed the many 
interviewees m the study who provided, on a voluntary basis, the time and 
personal interest needed to participate willingly in the mterv.ewmg process 
on three separate occasions. Finally, the major expression of appreciation 
must be reserved for the certified testers at each of the three agencies, whe 
maintained throughout the six days of testing a seriousness of purpose *r.d 
diligence of approach to their interviewing and rating tasks that ful<> 
demonstrate their high level of professionalism and competence m t"se 
proficiency testing role. 
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PROCEDURE 

0veral 1 £tud£ design. The basic experimental design for the study 
involved a "test-retest" procedure, in which each examinee was sequentially 
interviewed by a separate testing team from each of the three participating 
agencies, in conducting its own interviews, each team made use cf the 
particular interviewing techniques and procedures for arriving at a final 
rating that were currently in use at that agency. On completion of the 
process, the team reported a single overall rating on the numerical proficiency 
scale and associated verbal descriptions of performance endorsed by the ILR 
member agencies in November 1981 as a "common metric" for speaking proficiency 
assessment and reporting. This scale comprises six major rat.ngs-O, 1, 2, 3, 
4, and 5— supplemented by five intermediate ("plus") ratings — 0*. !♦, 24, 34, 
and 44. The scale is intended to characterize the fu!l range of possible 
.'earner proficiency levels, from no functional proficiency in the language 
(level 0) to proficiency indistinguishable in all respects from that of an 
educated native speaker (level 5). 

Within the administrative and financial constraints involved, it was 
obviously not possible to carry out such a study for each of *** numerous 
languages in which the agencies routinely test, nor, within a given language, 
to involve each and every one of the examiners/testers currently conducting 
interviews within that language. With regard to the selection of languages 
for the study, discussion with the testing coordinators at each of the three 
agencies, as well as with the ILR testing subcommittee, resulted in the 
identification of French and German as two languages for wrich an adequate 
number of examinees and testers for the study could be made available withm 
each of the participating agencies, and for which the annual testing volume 
was sufficiently high to warrant priority attention from an administrative 
standpoint. With respect to the number of tester teams involved, staff time 
and travel cost considerations dictated a maximum of two teams per language, 
for each of the three agencies, i.e., the following configuration: 



CIA French Team 1 (all teams are two-person) 
CIA French Team 2 

DLI French Team 1 
OLI French Team 2 

FSl French Team 1 
FSI French Team 2 



CIA German Team 1 
ClA German Team 2 

DL I German Team 1 
DLI German Team 2 

FSI German Team 1 
FSl German Team 2 



Select .on of testers and examinees. In order to enhance the I ike I ihocd 
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that, for each agency and language, the testers actually selected for the 
study would be representative of the total group of individuals operationally 
testing in that agency/ language, the testing coordinator at the ag?ncy was 
asked to provide a complete hst of qualified, currently active testers in 
each language. From this hst, the CAL project staff selected all study 
participants on a statistically random basis. It is thus considered that the 
composition of the tester groups from each of the three agencies constituted a 
rigorous random sampling of the population of testers in that language who had 
been identified by the agency as proper\y qualified and actively testing within 
that agency. 

A second important design consideration was the selection of examinees. 
It was considered highly desirable by the project staff, as well as by the ILR 
testing committee, to investcgate rating performance across the full range of 
proficiency levels by including in the examinee pool individuals covering the 
gamut from the lowest measurable level (O*) to the functional equivalent of an 
educated native speaker (5). At the same time, in view of the fact that the 
bulk of operational testing at each of the agencies is concentrated within a 
somewhat smaller band (roughly 1 to 3/3* for CIA and DLI, 2 to 4 for FSI), it 
was considered important to insure that a reasonably large number of examinees 
within this "higher-volume" "ange would be included in the study sample. To 
help provide a distribution of examinees for the study that would satisfy both 
of these criteria, the testing coordinators at each agency were asked to locate 
and arrange for the participation, per agency, of 20 examinees in each 
language, and to select these individual a— on the basis of coordinator or 
language instructor judgments about their proficiency and/or recent interview 
scores m the agency files— so as to reflect as closely as possible the 
distribution of proficiency levels shown in Figure 1. 

The coordinators were asked to employ, to the extent possible, a 
stratified random sampling procedure (for which detailed instructions were 
given) in identifying the particular examinees who would be asked to 
participate. The total pool from which the examinees at a given agency were 
to be drawn was defined to include, in addition to currently-enrolled 
students, other categories of individuals that the agency would typically have 
the occasion to test in the course of its ongoing testing activities (for 
exanple, instructor applicants at DLI, career officers at FSI). Due to a 
variety of factors, including scheduling conflicts on the part of potential 
examinees, the necessarjly voluntary nature of participati xi, and the 
need to locate substitute interviewees on several occasions during the course 
of the testing, it was not possible to rigorously implement a statistically 
random process of examinee selection. However, since the major intent in 
selecting examinees was simply to provide an appropriate overall distribution 
of proficiency levels across evinces at each agency, departure from strict 
random selection of the examinee group was not considered a significant 
procedural drawback nor an impediment to the proper interpretation of the 
tester-specific infonr^iion on which the study was primarily focusc-J. 

Schedul mg of interviews. Interviews were conducted on a sequential 
basis, with two days of testing taking place at each agency. Testing dates 
were: FSI - September 9-10, 1985; CIA - September 11-12; DLI - September 
17-18. Oi each of these dates, the "home" agency made available all necessary 
interviewing rooms and other facilities and was responsible for scheduling and 
contacting the examinees to be tested at that agency by all three tester 
groups . 
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Fiflure 1 

intended Distribution of Examinee Proficiency Levels at Each Agency 

Level No. of Examinees 



0* 


1 


1 


1 


u 


£ 


2 


3 


a* 


3 


3 


3 


3* 


3 


4 


2 


44 


1 


5 


1 



20 



6 
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Project staff forwarded the testing coordinator at eaon agency a detailed 
schedule (See Figure 2) for allocating examiners to testing teams in sucn a way 
as to counterbalance the agency-order in wnicn the examinees would be tested 
as well as to statistically randomize other (uncontrolled) effects 
attributable to examinees. The test administration schedule, which was 
followed w.tn only very minimal exceptions at FSI and CIA, involved the 
administration of all three interviews to a given examinee within a single day. 
For example, as snown in Figure 2, "examinee 1" was interviewed by a CIA tester 
team during the first one-hour time period of Day 1, by a DLI team during the 
third time period, and by an FSI team during the fifth period. Examinee rest 
breaKs of at least one hour were provided between interviews, as well as a one- 
hour lunch break between either the first and second or second and third 
interviews. In addition to the lunch period, each tester team had a further 
or.e-hour break at some point in the testing day. 

In setting up the above test.ng schedule, it was understood and 
acknowledged that the per-day "interviewing load" on the part of the testers 
(six interviews on one day, four on the other) was in some cases more 
substantial than was typically the case »n ongoing testing work at the agency. 
However, counterbalancing considerations of increased staff costs, additional 
travel /subsistence expenses, and potential inconvenience on the part of 
examinees who would be required to appear again on a second or even third 
testing day, dictated adoption of the indicated strategy. In a debriefing 
questionnaire completed at the end of the testing sessions, several examiners 
reported that they felt somewhat burdened by the overall quantity of interviews 
required over the available time span, but also for the most part noted that 
they considered their interviews and associated ratings given »n the course of 
the study to be as thorough and as accurate as those earned out in regular 
agency testing. 

At DLI, due to restrictions imposed on the scheduling arrangements by 
both the overall daily schedule at the agency and by individual examinees' 
classroom session assignments, it was necessary to adopt a somewhat modified 
procedure in which, for a given examinee, the three interviews were held over 
a two-day period, on either a 2-1 or 1-2 basis. This modification also 
resulted in a slightly easier and more uniform interviewing pace on the part of 
the testers, who, with very few exceptions arising from the occasional need to 
•catch up" for a student who had failed to appear at an assigned testing time, 
were required to test only 5 students on each of the two days. 

interviewing procedures. All tester teams were extensively advised, 
both in memoranda circulated prior to the testing and verbally at the beginning 
of the first testing day, to carry out each interview in strict conformance 
with the procedures currently in effect at the testers' agency, including, as 
appropriate, the use of any routine auxiliary materials (e.g., cue cards 
describing situations that the student is asked to deal with, background 
reading materials associated with FSI "briefing" task, and so forth). In 
addition, the testers were to follow whatever procedures they normally used in 
arrvmg at a f.nal interview rating, including, for example, jointly discuss. ng 
the interviewee's performance; reviewing the verbal proficiency descriptions; 
and considering (and, if it was the operational procedure at the agency, 
rating) th<? speech sample with respect to specified sub-factors of 
performance. Each testing team was also asked to report the final global 
rating, as well as any factor scores or other routine annotat ions/feedbsck 
information, on the printed forms in use at their agency for this purpose. If 
separate forms were normally completed by each tester, both were to be 
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F igure 2 



Interviewing Schedule 



(Cell entries are examinee IDs; sane sequence used for French and German) 



Day One 



Day Two 



Jime Slot: ABCDEF 6 H J J K L 



CIA Team 1: 1 2 3 4 5 6 



11 12 13 14 



CIA Team 2: 7 6 9 10 



15 16 17 18 19 20 



DLI Tea-n 1: 5 6 1 2 3 4 



11 12 13 14 



DLI Team 2: 



7 8 9 10 



19 20 15 16 17 18 



FS I Team 1 : 3 4 5 6 1 2 



13 14 



1 1 12 



FSI Team 2: 9 10 



7 e 



17 18 19 20 15 16 
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submitted; if there was any disagreement concerning the final rating, the 
testers were to resolve the issue among themselves and circle or otherwise 
indicate the official "final" rating on one' or the other of these forms. 

In the course of the interviewing process at all three agencies, the 
author and another professional project staff member separately sat in on a 
total of approximately twelve interview sessions, distributed fairly randomly 
across languages, agencies, and interviewer teams. All interviews conducted by 
the tester teams, whether or not they were also observed by project staff, were 
audio recorded on C-90 cassettes, using tape recorders with built-in 
microphones, with the recorders placed on a table between the ex*nmee and 
testers, tn most instances, the raters' post- interview discussion of the 
examinee's performance was also recorded. $pot-chec*ing of a nuitoer of 
completed tapes indicated that the spoken material wis in general clearly 
audible with respect to both the examinee and interviewers. The obtained total 
of over 300 interview recordings is considered to provide a valuable corpus for 
further linguistic analysis or other follow-up study. 

Across all three agencies, 115 examinees were interviewed by testers from 
each of the three agencies, out of a design total of 120. This very high level 
of participation is due to both the diligence of the testing coordinators in 
making the initial atfninistrati ve arrangnents for the interviewing and their 
willingness and ability to readily locate appropriate substitute interviewees 
as the occasion required over the course of the six testing days. 



RESULTS 

0veral 1 results. Two types of analysis, chi-square and analysis of 
variance, were conducted for the testing results as a whole, that is, for the 
scoring performance of testers across both language groups combined. Table 1 
shows the observed and expected frequencies of ratings from 0/04 (these two 
levels combined to provide adequate cell size) to 5 on the part of the CIA, 
DLI, and FSI rating teams. The overall chi square of 20.3, with a chance 
probability of .32, fails to demonstrate a statist icai \y significant 
difference across agencies with respect to the rating of examinee performance 
on a global (combined languages) basis. Alternatively stated, this statistical 
test indicated an approximately 1 in 3 chance that the observed differences 
across agencies in interview scores assigned to given examinees were due sirrply 
to random statistical effects rather than to sgency-specif ic differences in 
rating tendencies. It is customary not to consider differences between or 
among groups to be "significant" unless there is a less than 1 in 20 chance 
probability (usually abbreviated as jp < .05) that the observed results are due 
to factors other than random variation. As shown in Table 5, nonsignificant 
results (F r 2.27; p r 0.10) for combined French and German interviews were 
also obtained for a between- and wi thin-eroups analysis of variance, a 
statistical procedure that also serves to determine the likelihood that the 
observed results are a consequence of random variation rather than true inter- 
group differences. 

Chi -square analyses were also conducted separately for the French (Table 
2) and German (Table 3) data. Nonsignif icant differences were again found 
for both languages, with a quite high chance probability for French (.71) and 
a lower, but still nonsignificant probability (.10) for German. Tfiese results 
may be interpreted as indicating a 7 in 10 likelihood that the observed rating 
differences among the three agencies with respect to the French testing were 
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Table 1 

Chi -Square by Agencr and Interview Score Assigned 
(French ard German) 



Observed I 


1 


1 






Expected 1 


1 


1 






( 0 - E) 1 


1 


1 




Row 


Contribution 1 


OA 1 


DLI 1 


FSI 


Totals 


1 


8 1 


9 1 


3 




0, 0* 1 


6.7 1 


6.7 1 


6.7 1 


20 


1 


1.3 1 


2.3 1 


-3.7 1 


3. 1 


j 


0.3 1 


o.e i 


2.0 1 




I 


19 ! 


15 1 


10 1 




1 1 


14.7 1 


14.7 1 


14.7 1 


44 


1 


4.3 1 


0.3 1 


-4.7 1 


2.8 


1 


1.3 1 


0.0 ! 


1.5 1 




1 


16 1 


17 1 


9 1 




1+ 1 


14.0 1 


14.0 1 


14.0 1 


42 


1 


2.0 1 


3.0 1 


-5.0 1 


2.7 


1 


0.3 1 


0.6 1 


1.8 1 




1 


12 1 


13 1 


21 1 




2 1 


15.3 1 


15.3 1 


15.3 1 


46 


1 


-3.3 1 


-2.3 1 


5.7 1 


3.2 


1 


0.7 1 


0.4 1 


2.1 1 




I 


10 1 


19 1 


15 1 






14.7 1 


14.7 1 


14.7 1 


44 


1 


-4.7 1 


4.3 1 


0.3 1 


2.8 


| 


1.5 1 


1.3 1 


0.0 1 






10 1 


11 1 


18 1 




3 1 


13.0 1 


13.0 1 


13.0 1 


39 




-3.0 1 


-2.0 1 


5.0 1 


2.9 




0.7 1 


0.3 1 


1.9 1 






16 1 


i 1 1 


1 1 1 




3* 1 


12.7 1 


12.7 1 


12.7 1 


38 




3.3 1 


-1.7 1 


-1.7 1 


1.3 




0.9 1 


0.2 1 


^.2 1 






10 1 


9 1 


13 1 




4 1 


10.7 1 


10.7 1 


10.7 1 


32 




-0.7 1 


-1.7 i 


2.3 1 


0.8 




0.0 1 


0.3 1 


0.5 1 
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Table 1 (cont.) 





A 1 

I O 1 


A 1 

D 1 


a i 
w 1 




44 


1 7.7 1 


7 7 1 

9,9 1 


7 7 I 


CO 




0.3 1 


-1.7 1 


1.3 1 


0.6 




0.0 1 


0.4 1 


0.2 1 






6 1 


5 1 


6 1 




5 1 


5.7 1 


5.' 1 


5.7 1 


17 




0.3 1 


-0.7 1 


0.3 1 


0. 1 




0.0 1 


0.1 1 


0.0 1 




)olimn 1 


115 


115 


115 


345 


Totals I 


5.7 


4.3 


10.3 


20.3 



No. of Observations = 345 Degrees of freedom r ie 

Chi square = 20.3 Chance probabi I ity = 0.32 
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Table 2 

Chi -Square for French interviews 



Observed 1 


1 


• 






Expected 1 


1 


1 






( 0 - E ) 1 


1 






Row 


Contribution I 


CIA 1 


DLI 1 


FSI 


Total s 


1 


13 1 


<3 1 


5 




0, 04, 1 | 


10.3 1 


10.3 1 


10.3 


31 


| 


2.7 1 


2.7 1 


-5.3 


4. 1 


I- 


0.7 1 


0.7 1 


2.8 




1 


10 1 


9 1 


5 1 




14 1 


8.0 1 


8.0 1 


8.0 1 


24 


] 


2.0 1 


1.0 1 


-3.0 1 


1.8 


I- 


0.5 1 


0.1 1 


1.1 1 




1 


8 1 


7 1 


12 1 




2 1 


9.0 1 


9.0 1 


9.0 1 


27 


| 


-1.0 1 


-2.0 1 


3.0 1 


1.6 


1- 


0.1 1 


0,4 1 


1.0 1 




1 


7 1 


7 1 


7 1 




2* 1 


7.0 1 


7.0 1 


7.0 1 


21 


| 


0.0 1 


0.0 1 


0.0 1 


0.0 




0.0 1 


0.0 1 


0.0 1 






4 1 


7 1 


8 1 




3 1 


6.3 1 


6.3 1 


6.3 1 


19 




-2.3 1 


0.7 1 


1.7 1 


1.4 




0.9 1 


0.1 1 


0.4 1 






6 1 


5 1 


8 1 




3* 1 


6.3 1 


6.3 1 


6.3 1 


19 




-0.3 1 


-1.3 1 


1.7 1 


0.7 




0.0 1 


0.3 1 


0.4 1 






5 1 


7 1 


6 1 




4 1 


6.7 1 


6.7 1 


6.7 1 


20 




-1.7 1 


0.3 1 


1.3 1 


0.7 




0.4 1 


0.0 1 


0.3 1 





12 



11- 



Table 2 (cont.) 



8 I 
7.3 i 
0.7 I 
0.1 I 
1. 



6 I 
7.3 , 

-1.3 I 
0.2 I 

1. 



6 I 
7.3 I 
0.7 I 
D.I I 
1. 



22 
0.4 



61 

2.7 



61 
1.9 



61 

6. 



163 
10.6 



No. of observations = 183 Degrees of freedom 
Chi square = 10.6 Chance probabi I >ty 
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Table 3 

Chi -Square for German Interviews 



Observed I 


1 


1 






Expected I 


1 


1 






( 0 - E ) 1 




1 




! Row 


Contribution 1 


CIA i 


DLI 1 


FSI 


1 Totals 


1 


14 1 


11 1 


8 




0, 04, 1 | 


11.0 1 


11.0 1 


11.0 


33 




3.0 1 


0.0 1 


-3.0 


1 6 


1 

'■ 


0.8 1 


0.0 1 


0.8 i 




1 


6 1 


e i 


4 1 




1* 1 


6.0 1 


6.0 1 


6.0 1 


18 




0.0 1 


2.0 1 


-2.0 1 


1 . 3 


1 


0.0 1 


0.7 1 


0.7 1 




1 


4 1 


6 1 


9 1 




2 ! 


6.3 1 


6.3 1 


6.3 1 


19 




-2.3 1 


-0.3 1 


2.7 1 


2.0 




0.9 1 


0.0 1 


1.1 1 






3 1 


12 1 


8 1 




2* 1 


7.7 1 


7.7 1 


7.7 1 


23 




-4.7 1 


4.3 1 


0.3 1 


5.3 




2.8 1 


2.4 1 


0.0 1 






6 1 


4 > 


10 1 




3 


6 7 1 


6.7 


6.7 1 


20 




-0.7 1 


-? 


3.3 1 


2.8 




0.1 1 


1. . 


1.7 1 






10 1 


6 1 


3 1 




3+ 1 


6.3 1 


6.3 1 


6.3 1 


19 




3.7 1 


-0.3 1 


-3.3 1 


a. 9 




2.1 1 


0.0 1 


1.8 1 
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Table 3 (cont.) 
I I 





11 l 


7 I 


12 1 




4, 44, 5 


10.0 1 


10.0 1 


10.0 1 


30 




1.0 1 


-3.0 1 


2.0 1 


1.4 




0.1 I 


0.9 1 


0.4 I 




Co limn | 


54 


54 


54 


162 


Totals l 


6.8 


5.1 


6.4 


16.4 



No. of observations = 162 Degrees of freedom = 12 
Chi square = 18.4 Probability of chance s 0.10 



15 



German 



-14 



Table 4 



Chi -Square for Agency Pairs 



X2 



N 



df 



F rench and German 
CIA - DLI 
CIA - FSI 
DLI - FSI 

French 



4.6 



14. 1 



11.9 



230 



230 



230 



0.85 



0. 12 



0.22 



CIA - DLI 



1.6 



122 



0.98 



CIA - FSI 



8.3 



122 



0.30 



DLI - FSI 



7. 1 



122 



0.42 



CIA - DL ! 



CIA - FSI 



DLI - FSI 



8.7 



11.0 



8. 1 
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108 



108 



6 
6 
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0. 19 
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Table 5 

Analysis of Variance and t-Test Comparisons 
for Interview Scores across Three Agencies 



Source of Variation 



df 



Sum of Squares 



Mean Square 



French and German 

Between groups 
Within groups 
Total 



2 

342 
344 



722.649 
54477.078 
55199.728 



361.325 2.27 0.10 
159.290 



French 



Bet- en groups 
Within groups 
Total 



2 
180 
182 



648.995 
29232.361 
29881.355 



324.497 2.00 0.14 
162.402 



German 



Between groups 
Within groups 
Total 



2 
159 
161 



192.704 
25098.241 
25290. 944 



96.352 0.61 0.55 
157.851 
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Table 5 (cont.) 



t-Stat i st ics 



Compar i son 



French and German 

CIA - DLI .517 0.64 

CIA - FSI 1.531 0.22 

DLI - FSI 2.048 0.13 

F rench 

CIA - DLI .078 0.94 

CIA - FSI 1.691 0. 19 

DLI - FSI 1.769 0.18 

German 

CIA - DLI .674 0.55 

CIA - FSI .421 0.70 

DLI - FSI 1.095 0.36 
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purely attributable to chance factors. Although there is a smaller probability 
(1 in 10) that the German differences were also simply a result of "chance," 
this figure still does not reach the ccrrmonly-accepted 1 in 20 criterion for a 
statistically significant difference. Analysis of variance for French and 
German groups considered separately (Table 5) also shows nonsignificant rating 
differences for both languages across the three participating agencies. 

Additional chi -square analyses comparing the rating performance of 
individual pairs of agenoe:> (ClA-DLI, CIA-FSI, and DLl-FSI) are shown m Table 
4 for both whole-group and separate- I anguage comparisons. All of these are 
statistically nonsignificant (p > .05). As shown in Table 5, similar results 
are found for t-tests of agency pairs (an analysis of variance-type procedure 
applicable to comparisons of pairs of groups), none of which comparisons reach 
statistical significance at the .05 level. 

in suTmary of the overall analyses, »t may be concluded that the ratings 
assigned during this study by CIA, OLI, and FSI tester teams, when considered 
across all the examinee proficiency levels taken as a who'e, do not differ 
among the three agencies or between any pair of agencies in a statistically 
significant manner, either in combined (French and German) comparisons or in 
comparisons separately by language. 



Inter-agency patterns of score distribution. Although the whole-group 
comparisons of scoring performance across the three agencies do not reach 
statistical significance, examination of the particular scores assigned to 
examinees within various portions of the overall proficiency range reveals 
some very interesting patterning. Table 6 shows the interview scores assigned 
to each examinee by the CIA, DLI, and FSI testers, listed in order of 
increasing mean score across the three agencies and including both French and 
German groups. For any given examinee, an asterisk in one of the columns 
indicates that that particular score is higher than the scores given by tx>th of 
the other agencies. Of the 115 examinees interviewed, the CIA testers 
assigned, m 16 instances, a higher score than the other two tester te«ns. The 
DLI testers assigned higher ratings than their inter-agency colleagues on 8 
occasions, and the FSI testers assigned higher scores in 43 cases. A fairly 
clear pattern is evident in the level 1, U, Z range, with the FSI testers 
tending in many instances to assign a 1* (or in a few instances, a 2) to 
examinees rated as level 1 by CIA and DLI testers. A similar tendency is noted 
a half-step higher on the scale, with FSI testers assigning level 2 to a number 
of examinees rated as 1 or 1 4 by the other two agencies. A less marked 
tendency to assign 24 vis-a-vis U or 2 is also noted. 

A tendency on the part of the FSI raters to assign level 3 scores to 
examinees rated lower than level 3 by the other two agencies is not evident in 
the data. While there are 6 such instances in the combined French and German 
data, there are 5 cases m which the CIA testers assigned 3 or 34 to examinees 
rated as 24 or lower by both the DLI and FSI teams. Beyond level 3, the 
distribution of assigned scores across the three agencies shows generally 
random differences, with no discernible agency-specific patterning. 

Table 7 shows the distribution of ratings across agencies for the French 
testers separately. The tendency toward relatively higher ratings on the part 
of the FSi French raters is even more marked than for the combined language 
group, with higher-than-the-other-two-agency scores assigned by FSI to 30 of 
the 61 French examinees. Ratings of the CIA French testers were higher than 
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Table 6 

Examinee Score Levels Assigned, by Agency 
(French and German) 

Legend: 00 = 0, 07 ; 0*. 10 : I, 17 = etc. 

Asterisks indicate a score higher than that of the other two agencies. 
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Table 6 (cont.) 
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Table 6 (cont.) 
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Mean: 26.0 25.2 28. S 

S.D.: 13.4 12.5 11.8 

N: 115 115 115 
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Table 7 

Examinee Score Levels Assigned, by Agency 
(French) 

Legend: 00 = 0, 07 * 0*. 10 = 1, 17 = u, etc. 

Asterisks indicate a score higher than that of the other two agencies. 
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Table 7 (cont.) 
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those of their inter-agency colleagues in only 7 instances, and the ratings 
assigned by the DLl testers were virtually never higher (2 of 61 occasions) 
than both of the other agency teams. Rating patterns again show the higher 
FSI ratings to be most frequent at the 1, 1*. 2, and 2* levels. However, in 
the case of French, there also appears to be some tendency toward the award .ng 
of level 3 scores by FSI to examinees rated as 2* or lower by the other two 
agencies (5 instances on the part of the FSI testers, with oniy one comparable 
rating by CIA and none by CU). The inter-agency differences in mean interview 
scores for French, while not statistically significant, do show a clearly 
higher numerical value for FSI (29.5) than for CIA and DLl (25.6 and 25.5, 
respectively). 

Table 6 shows che distribution of German ratings on an across-agenc i es 
basis. B> contrast to the French data, an apparent tendency to half-point 
higher ratings on the part of the FSI testers is principally restricted to 1* 
vs. 1 and 2 vs. 1* comparisons, and is by no means as salient or as widespread 
across proficiency levels as is the case for the French group. Also noteworthy 
in the German ratings is a tendency to higher ratings on the part of the CIA 
testers in the middle level of the score range, with level 3 or 3* assigned by 
CIA to four examinees rated as 2* or lower by both DLl and FSl, and 3* awarded 
to three other examinees who were considered to be no higher than level 3 by 
the other two agencies. Mean German interview scores (26.5, 24.9, and 27.5 for 
CIA, DLl, and FSI, respectively) did not differ significantly across agencies. 

Across-agency differences in scoring patterns may also be examined by 
means of expectancy tables based on the frequencies which which raters from 
pairs of agencies assigned particular level scores to given examinees. Table 9 
shows, for each of the levels assigned by the CIA French testers, the 
corresponding level assignments of the DLl testers. For example, for the total 
of 9 interviewees who were rated as level 1 by CIA, 56 percent of these 
examinees were also rated as level 1 by DLl , 33 percent were rated as level 0*. 
and 11 percent, as level 2. For the 10 examinees rated as i« by CIA, the DLl 
ratrngs were split at 40 percent each for level i« and 2, and 10 percent for 
levels 1 and 2*. The discrepancies are more marked for the comparison of CIA 
and FSI ratings .n French (Table 10), which shows, for example, that examinees 
considered to be at level 1 by the CIA testers were- in a majority of cases 
rated as 1* (56 percent) by the FSI testers and in third of the cases, as 
level 2. At this level, 69 percent of the "level 1- examinees by CIA standards 
were rated as level 1* or higher by the FSI testers. The tendency continues 
through "CIA levels" U, 2, 2*. 3, and 3*. with the major. ty of FSI ratings 
bemg at least a half-level higher , n all fivt. comparisons, w.th the except. on 
of "DLl 2*," comparisons of DLl and FSI French scores (Table 12) reveal an 
essentially similar pattern across DLl levels 0 through 3, with the bulk of the 
FSI scores consistently a half- level or more higher than the scores assigned to 
the same examinees by the DLl testers. 

For German, there is no consistent pattern of higher or lower ratings 
between the CIA and DLl raters from levels 0* through 2* (Table 15), but at 
"CIA levels" 3 and 3*. the DLl raters were seen to assign somewhat lower 
ratings on the whole, with an appreciable spread at CIA 3* , where 50 percent of 
the corresponding DLl ratings were a full level lower and 20 percent, a level 
and a half lower. For CIA-FSI compansions in German (Table 16), there is a 
clear pattern of at least ha If -point higher FSI ratings at CIA levels 0* 
through 2*. and a similar pattern for DLl-FSl comparisons (Table 16) A 
particularly large discrepancy is noted for DLl level 2*. which shows 
corresponding FSI scores ranging from 2 to 4*. 
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Table 8 

Examinee Score Levels Assigned, by Agency 
(German) 

Legend: 00 = 0, 07 = 0+, 10 = 1, 17 = u, etc. 
Asterisks indicate a score hipher than that of the other two agencies. 
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Table 8 (cont.) 
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Tible 9 



bpectincy Table for HI iro* CIA Scores 
(French) 

Cel! trtfiei iho. the percentege o< eiuiom iiiigned given icoret by OLI for eich level iiugned by CIA. 
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Table 10 



Eipectancy Table for FSI troe CIA Scorti 
(French) 

Call tntrtH sh M tha percentage of me inns aitignad given scores by FSI for each level assigned by CIA 
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Tabic 11 

EipecUncy Tabic lor CIA fret DU Scores 
(French I 

percentage ot eiaainees assigned given scores by CI* for each level assigned by HI. 
C I A 

> 4 2 2* 3 3* 4 4* ! N 



t 1* 



I 100 I 
















I 50 I 30 
















I I 13 


I 17 
















I 44 


1 33 


1 11 




I 11 






I I 14 


1 57 1 


29 














1 14 1 


29 
14 


1 29 1 

: 43 ] 


29 
14 


1 14 ] 


14 1 










20 1 




1 20 ] 


40 1 


20 I | 










SO 1 


43 1 


14 ] 
50 1 


43 I ] 
















25 I 75 1 



30 



-29- 



Table 12 



Cipectancy Table for FSl fro* 0LI Scores 
(French) 

Cell entries sho. the percentage of e.aemees assigned given scores by FSI for each level assigned by DLL 



F S I 





©♦ 


1 


I* 


2 


2* 


3 




4 


44 


5 


N 


o 


I 100 




















I 


0* 




1 so 


I 90 
















6 


1 




17 


I 33 


SO 














0 


u 

2 








54 
43 


I 22 ] 
I 14 ] 


22 
43 










9 

7 


2* 








14 


I 57 ] 


14 


14 ] 








7 


3 












29 1 


43 ] 


29 I 






7 


Z* 














10 1 




20 i 




5 


4 
















57 ] 


43 ] 




7 


4« 
















100 I 






2 


5 


















25 I 


75 I 


4 



























9 

ERIC 



31 



-30- 



Tible 13 



Eiooctoncr Table for CIA froo FSI Scoris 
(FreneM 

C*M entries iho. the ptrnotop of oiooinh oMianed fiwo icoroi oy do for loch love! onioned FSI. 
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Tible 14 



Eiprctmcy Tiblr for ill irn FSI Scorn 
(fronts! 

Cell ffttrjoi ih M the porcontigo of iiMinen usionrtf tmn icorci by KI for nth level 



iiujrtd by FSI. 



I 

!♦ 
2 
2* 
3 

3* 
4 
4« 
S 



8 I I 
2 2* 



4* 













~I- 












I 100 










I 














75 


1 25 






--J- 
I 

— I- 














60 


40 






I 

~I- 
















25 ; 


42 I 


25 


I 

-I- 


1 
















29 I 


14 


I 

-I- 


57 
















25 I 


38 


I 

-1- 


13 


25 I 


















I 

-I— 


13 ; 


38 I 


50 
















I 

~1~ 




25 I 




: so 


25 I 












I 

-I- 






20 ; 


40 1 


I 20 1 












I 










1 100 1 



ERIC 



33 



32- 



Table IS 



Cipectincy Table tor BL1 trot CIA Scores 
Heroin) 

Cell entr.e. .ho. the percentile of tuo.neei mioned Given .corei by DLI tor eich level manned by 
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Tible 16 

Expectancy Table for FSI fro* CIA Scores 
(Strain) 

*ercenta«e of rm mm assianetf given scores ay FSI for each level assigned ov CIA. 
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Table 17 

Ciprctincy Table for CM trot DLI Scorn 
(Strain I 

prtentije oi tiatinwi assigned gmn scorn by CIA for nch trvrl anion* b, DLI. 
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Toble IB 



Eiptctancy Table for FSI froo DL1 Scorn 

(SfTMR) 

Cell entnei tho* the percentage of eiaointes assigned given icorrt by FSI for each 1ml assigned by 
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Toble 19 



Cipectincy Tible for CIA fros FS1 Stores 
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Table 20 

Expectancy Table for DLI trot FSI Scares 
(fieraan) 

Cell entries show the percentage of exaainees assigned given scores by DLI for each level assigned by FSI. 
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Tables 13-14 (for French) and Tables 19-20 (German) show the variation m 
scores observed for the CIA and DLI interviews for given score levels on the 
FS I -conducted interviews. These data may be -read" in the same manner as those 
shown m the other expectancy tables. For example, as shown in Table 13, of 
the 12 French interviewees assigned a rating of 2 by the FS I testers, 25 
percent received a score of 1 in the interviews conducted by CIA; 60 percent 
received a rating of 1+; and 25 percent, a rating of 2. 

Three major considerations should be kept in mind in evaluating the 
observed results. First, at issue m this study is the test-retest reliability 
of the interviewing process, in which the intent is to determine the extent to 
which given examinees, undergoing separate, independent interviews by each of 
the three agencies, will be assigned similar level scores in each instance. 
Observed variation in examinee score levels may be attributable— in proportions 
that it is not statistically possible to determine on the basis of the present 
study— to actual performance differences on the part of the examinee across the 
thr«e interviewing occasions, as well as to agency-specific differences in the 
manner in which a given examinee performance would tend to be evaluated across 
the three agencies. It is, therefore, possible to suggest that at least seme 
of the scoring differences observed in this study may be attributable to 
i nterv i ew-to- interview variation in performance on the part of the examinees, 
rather than to rater unreliability per se. However, if the intended 
operational assumption is that the face-to-face interviewing technique 
(assuming good will and serious comnun i cat i ve effort on the part of tne 
examinee and diligence and proper attention to el i citation procedures on the 
part of the examiners) should result in the awarding of similar ratings on 
closely contemporaneous interviewing occasions, the procedure used in this 
study may be considered an appropriate empirical approach to determining the 
validity of this assumption, within the general linguistic and personnel 
parameters involved (a sampling of two interviewer pairs for two languages 
across the three participating agencies). 

Second, although the total number of interviews obtained in the study was 
as large as practicable within the financial and administrative constraints 
involved, and may be considered to provide a reasonably stable and accurate 
indication of the results that would be secured in a similar but larger study, 
some caution in interpretation and extrapolation should be exercised, 
especially m analyzing those expectancy table columns and associated data that 
are based on a relatively smaller number of interviews. 

Third, the expectancy table data should not be viewed as representing 
m any sense "true" level ratings on the vertical axis. These tables simply 
show the extent to which the agencies in question tended to vary in the 
frequencies with which they assigned a given level score to a particular 
examinee. Any determination of which, if any, of the ratings assigned should 
be considered to reflect the "true" proficiency level of the examinee is beyond 
the scope of this study and, indeed, represents a question for which 
statistical data per se are, at best, of very limited value. Although there is 
some indication that, for the two languages involved, the interview ratings 
assigned by CIA and DLI were for certain portions of the overall proficiency 
scale more similar to each other than they were to the corresponding ratings 
assigned by FSI, it cannot and should not be concluded from these results that 
the former ratings were found to be "correct" and the latter "incorrect " in 
any useful external or cntenal sense of the term. 
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Anajjrsis of rating "fetor" data. Some addit. na! information, especially 
for the interview* conducted by the FSI testing teams, ,* available concerning 
the statistical interrelationships of the raters' scoring of various linguistic 
categories or "factors" that are generally considered to contribute 
collectively to overall proficiency as e; ^essed in the global rating, but at 
the same time to provide a certain amount of diagnostic feedback concerning 
particular sub sspects of performance (with.n a given global level) exemplified 
by a given examiree. Table 21 shows, for combined French and German data, the 
observed .nterccrrelations of the FSI global rating and each of the five 
"factor" scores--" I i stem ng comprehension," "discourse," "structure," 
"lexical ization," and "f«uency"~ regularly assigned by FSI testers for 
interviews conducted by that agency, as an aid in focusing c.i component aspects 
of the global rating end in providing for greater Abjectivity in the rat i.iq 
process o vera 1 1 . 

The observed high correlations may be considered attributable to * w 
combined effects of at least two possible sources of correspondence: •>. ,e" 
close relationships among the factors as exemplified in the examinee's 
performance; and a potential "halo effect" arising from the fact that all 
factor scores are assigned by the same testers, who may be influenced to some 
extent by examinee performance on one or- more of the other factors whi le 
attempting to objec\ vely rate a given factor. Although the correlational data 
suggest that, on a total-group basis, r* at. vely utile additional information 
is provided by the individual factor scores that is not already statistically 
captured in the global rating, the scoring profiles of particular examinees 
whose pattern of factor ratings shows an appreciable departure from linearity 
may be of interest from a diagnostic or pedagogical standpoint. For example, 
the scatterplot of "structure" vs. "lexical ization" scores shown in Table ?? 
shows three examinees whose factor ratings for "structure" were proportionately 
appreciably higher then their ratings for "lexical izatien"j and three other 
examinees for whom the "lexical ization" scores were noticeably higher than the 
"structure" scores. Although detailed linguistic review of the inter ^w.ng 
performance of particular examinees is beyond the scope of the preset ,-eport. 
the scoring data obtained in the study can serve to identify these ».u other 
•discrepant' ~ases for further clinical analyses addressing, for example, the 
so-called "street I earner /school learner" performance differences frequently 
reported m operational testing activities. 

Detailed factor score data are not available for the CIA or DLl interviews 
ir. that, for the most part, testers from these two agencies followed the 
current operational procedure of providing only the overall global rating, with 
the single exception of a separate "listening comprehension" score that was 
consistently awarded by by the CIA raters and in about two-thirds of the cases 
by the DLl raters. The obtained "'istening* vs. "global" correlations were .97 
for the CIA interviews (N t 114) and .98 for the DLl interviews (N s 85). 
These data again suggest that, on a whoie-group basis, very little "new" 
information is provided by the separate I istening score. Analysis of 
individual discrepant cases for conical or pedagogical purposes would of 
course be possible for the CIA end DLl data as well as for the FSI interviews. 

Examinee and tester feedback on interviewing process. The observations 
and opinions of both examinees and interviewers concerning various aspects of 

Ihrou2i e ^ e !iIl2 procedure8 88 exemplified during the .tudy were solicited 
through two separate ^ jest lonna ires (Appendices A anl B) . The exam nee 
questionnaire requested information on the examinee's affiliation and test 
language, and both "yes-no" and open-ended conment responses to the following 

er|c 41 



-39- 



Table 21 

I nte -correlations of FSI Factor Ratings 
(N * 115) 
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Table 22 

FSI Structure vs. Lexical izati on 
(r s .96; N e 115) 
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questions for each of the three interviews taken: 

"Did the opportunities the testers provided you to speak the language 
during the [first, second, third) interview (in terms of the type and number of 
topics covered, range o* performance required) adequately probe your maximum 
proficiency level?" 

"Did the testers during the [first, second, third) interview use any 
ehcitation techniques or cover any kinds of topics that you thought were in 
any way "unfair" or in some other way not a valid test of your speaking 
proficiency?" 

"During the [first, second third) interview, did the testers appear to 
make a conscious effort to put you at ease?" 

Four additional summary questions involved forced-choice judgments as 

fol lows: 



"M which of the three interviews do you feel you. . . 

were most relaxed and at ease ? 

were the most anx i ous or nervous ? 

best demonstrated your optimum speaking proficiency? 

least we 1 1 demonstrated your optimum speaking proficiency?" 

Questionnaires were distributed to the examinees by the testing 
coordinator at each agency within about one week following completion of the 
interviewing process, an approach intended to avoid the possibility that 
examinees filling out the questionnaire immediately on completion of the 
testing might be disproportionately influenced by the. r experience in the most 
recently- taken interview. The examinee was asked to provide his or her name on 
an attached slip in order to properly categorize the 'first,' 'second,' and 
'third' interviews taken, tJt was assured that the slip would be removed when 
the results were summarized and that al ! data would be analyzed and reported on 
an anonymous basis. A self -addressed, postpaid envelope was provided for 
return of the questionnaire. Of the lib examinees oartic.pat.ng in the study, 
questionnaires wer-e returned by 83, a response rate of 72 percent. 

Table 23 provider a summary of the examinee questionnaire responses. To 
the question "Did the opportunities the testers provided you to speak the 
language during the interview. . .adequately probe your maximum proficiency 
level?," by far the greatest number of responses (87 percent overall) were in 
the affirmative. Ch. -square analysis for CIA, DLI, and FSI interviews showed 
no significant differences across agencies ,n the frequency with which the 
examinees reported an adequate probing of maximum proficiency level. To the 
question of el. citation techniques or coverage of top.es that the examinee 
considered "'unfair' or m some other way not a valid test of your speaking 
proficiency," 82 percent of the total judgments across agenc.es were that 
unfair techniques had not been used. However, on an agency-specif ic bas.s, 
the corresponding chi -square is highly significant (p r .007) with 29 percent 
of the FSI interviews being judged as "unfair" in procedure or topical 
coverage, as contrasted to 13 percent and 12 percent for the Ci A and DLI 
interviews, respectively. 
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Table 23 

Summary of Responses to Examinee Questionnaire 



D.d the opportunities the testers provided you to speak the language during 
the interview (m terms of the type and number of topics covered, range of 
performance required) adequately probe your maximum proficiency level?" 

CIA DU FSI 

Interview Interview Interview Total 

rES M. 86/ 85/ 87/ 

"° *0Z 14/ 15/ 13/ 

Total Responses: 72 72 75 219 

Chi square r .92; p = .63 



"Did the testers use any ehcitation techniques or cover any kinds of topics 
that you thought were in any way "unfair" or in some other way not a valid test 
of your speaking proficiency? 



CjA DLI FSI 



Total 



YES 13/ 12/ 29/ ie/ 

•*> 67/ 88/ 71/ 82/ 



Total responses: 83 

Chi square s 9.93; p - .007 



83 



63 
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Table £3 (cont. ) 

"Did the testers appear to make a conscious effort to put you at ease?* 

£1* DLi FSi Total 



YES 69/. 95X B'dV. 



eyz 



Total responses: 83 63 82 

Chi square =7.50; p = .024 



248 



"In which of the three interviews do you feel you were most relaxe d and at 
ease?" 



CIA DLI FSI 

27 40 12 



"In wnich of the interviews do you feel you were the most anx ; ous or 

nervojs?" 



CJA DL_l FSJ 

15 9 50 



"in wh ch of the three interviews do you feel you best demonstrated your 
optimum speaking proficiency?" 



CIA DLI FSI 

23 29 25 



"in which of the three interviews do you feel you least we 1 1 demonstrated your 
optimum speaking proficiency?" 



CIA DL I FSI 

22 20 27 
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The question "Did the testers appear to make a conscious effort to put you 
at ease?" was answered affirmatively in almost 9 out of 10 cases overall (69 
percent), but chi -square analysis again snows a significant across-agency 
difference (d s .024), with a somewhat smaller proportion of the FSl interviews 
(BKpercent) being judged as consciously directed toward putting the examinee 
at ease, by comparison to the corresponding CIA (89 percent) and DLI (95 
percent) figures. 

Although the total nunfcer of data elements for the forced-choice questions 
(one rather than three per examinee) are insufficient for across-agency 
statistical comparison, the absolute frequencies of response to these questions 
appear to corroborate rather closely the results of the earlier questions. To 
the question, "in which of the three interviews do you feel you were most 
re'axed and at ease?." 40 interviewees indicated "DLI"; 27, "CIA"; and 12, 
"FSl". The conversely-phrased question, "in which of the three interviews do 
you feel you were the most anxious or nervous ?." showed even greater 
differentiation across agencies, with only 9 interviewees identifying "DLI": 
15, CIA"; and 50, "FSl." Notwithstanding an apparent clear discrimination on 
the examinees' part as to the relative ease/anxiety producing qualities of the 
interview as conducted by each of the three agencies, no appreciable across- 
agency differences are shown in their judgments of the agency providing the 
best or worst opportunity to demonstrate their optimum speaking proficiency. 

To determine possible differences in questionnaire response tendencies 
attributable to an interaction between the agency affiliation of the examinees 
and that of the tester teams— that is, to investigate the possibility that, for 
example, DLI students mighx have reported different experiences or opinions 
concerning their participation in the DL I -conducted interviews than d.J 
interviewees from CIA or FSl being tested by the DLI team (or analogously for 
other examinee/agency combinations)— additional chi square analyses of each of 
the questions summarized in Table 23 were carried out for the crosstabulations 
of interviewee agency and tester agency. All of these analyses showed 
nonsignificant (o > .05) interaction effects, suggesting that reported examinee 
reactions to their experiences in being tested by each of the three agencies 
did not vary to any meaningful extent as a consequence of their own agency 
affiliation. These results must be considered only suggestive in view of the 
fact that, at all three agencies, a few of the examinees (particularly at the 
higher proficiency levels) were necessarily drawn from agency alumni or other 
sources. As such, their own reactions to the interviewing process may not have 
been fully typical of those of the current students; however, to exclude these 
non-student cases from the interact ion analysis would have reduced the already 
small cell sizes to statistically inappropriate levels. 

The questionnaire completed by the examiners themselves (Appendix B) was 
somewhat less formal than the examinee questionnaire and requested open-ended 
ccrmwnts by the testers concerning several aspects of their interviewing in the 
course of the project. Of the 24 testers participating in the study, 16 
returned completed questionnaires (75 percent). Responses were on an 
intentionally anonymous basis, with only the tester's "language and agency 
aff, hat.on- being requested on the questionnaire form. As shown in Table 24, 
based on the project staff's categorizations for analysis purposes of the free- 
response answers, the great majority of testers felt that the interviewing 

h 2 d u8ed 4 dur '"8 the study were the same as those used during 
routine, day-to-day testing" at the.r agency; and that the ratings which they 
assigned were, on the whole, as accurate as those typically made during 
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Table 24 

Sumiary of Responses to Tester Questionnaire 



"Do you feel that the interviewing procedures (ehcitation techn.ques, use of 
props, role-plays, etc.) you used dur.ng the study were the same as those you 
use in routine, day-to-day testing?" 



SOMEWHAT 2 
DEFINITELY 16 



"Do you feel that the ratings you assigned during the study were, on the whole, 
more accurate, about as accurate, or less accurate than ratings you typically 
make in routine testing at your agency?" 

NOT AS ACCURATE 1 
ABOUT AS ACCURATE 17 



"Do you feel that the accuracy of your ratings varied at certain times or 
points during the six-day testing period?" 



NOT AT ALL 9 
A LITTLE 6 
SOMEWHAT 3 



"Did you notice any differ- „es .n the composition of the exwn.nee groups at 
the different agenc.es r .n respect to overall levels of proficiency, examinee 
reactions to interview techniques, etc ?" 



NOT AT ALL 4 
A LITTLE 4 
SOMEWHAT 10 
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Table 24 (cont.) 

"Oo you feel that participation in the project was in any way interesting or 
beneficial to you?" 



NOT AT ALL 1 
SOMEWHAT 4 
DEFINITELY 13 
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operationai testing. To the question, "Do you feel that the accuracy of your 
ratmfls varied at certain times or points during the »ix-day testing period?," 
most respondents were of the opinion that their judging accuracy had not varied 
appreciably over the course of ti* testing, but some cited the relatively 
intensive testing schedule (involving in tome cases up to six interviews per 
day) as a potential source of end-of-day fatigue and consequent lack of full 
and "fresh" attention to the interviewing and rating tasks. With respect to 
the possible effects of "examiner fatigue" on the overall study results, it 
should be emphasized that the counterbalanced scheduling of the interviewing 
sessions was designed to adjust operationally for this and other possible 
sequence-of- interviews-related factors insofar as the inter-agency comparisons 
at issue m the study are concerned. 

Some differences in the overall composition of the examinee groups at the 
three different agencies were also noted by the testers, with the FSI and CIA 
examinee's, - general, considered to br more proficient on a total-group basis 
then the DLI interviewees. Agam, the balanced nature of the study design, in 
which testers from each agency interviewed the some examinees at all three 
testing locations, would be expected to rule out any effects of inter-agency 
differences m examinee populations with respect to the project results per se. 

Despite the fairly rigorous testing schedule, which involved both 
concentrated interviewing on a day-to-day basis and travel between Washington 
and Honterey within a relatively brief time span, the great majority of 
interviewers felt that their participation in the project had been of interest 
and benefit to them. Cited especially in this regard were the opportunit"»s to 
meet and interact with testers from other agencies and to "share notes" on both 
a personal and professional basis. Several examiners expressed the hope that 
similar projects undertaken m the future could have built into them more 
extensive and more formally-structured opportunities for thi» type of 
interact i on. 



SUMMARY 

The major results of the study may be summarized as follows. With respect 
to the testing of French and German by trained CIA, DLI, and FSI interviewer/ 
raters, as represented by two randomly selected two-person teams for each 
agency and language, who interviewed and rated a total of 20 examinees each 
across essentially the full spectrum of proficiency levels, the ratings 
assigned did not differ across agencies in a statistically significant way, 
either on a combined (French plus German) or individual -language basis. 
Notwithstanding these overall results, examination of the rating performance 
for various sub-portions of the proficiency scale showed fairly clear across- 
agency differences for both languages, primarily at the lower and middle ranges 
of the scale, with these differences for the most part reflecting relatively 
higher rating assignments on the part of the FSI raters by compansion to the 
ratings given by the other two agencies. As shown both in the distributions of 
test scores for the same examinees across agencies and m a series of two-way 
expectancy tables denved from these distributions, there ere occasional fairly 
wide discrepancies in scoring for individual examinees, which suggests the 
advisability, on a follow-up basis, of clinically studying the most discrepant 
cases from both linguistic and interviewmg-procedure standpoints, to attenpt 
to identify common factors that may have contributed to these scon no 
d i f f erences . 
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Analys.s of the intercerrelat.ons of the FSl "factor" scores among 
themselves and with the global ratings snows very high correspondence among all 
of these variables. Correlations of the CIA and FSl "listening" scores w.th 
the global ratings were also extremely high. These results suggest that, 
notwithstanding the possible utility of the factor scoring process in 
facilitating the interviewers' overall rating task, relatively little new or 
different statistical information is provided by the factor scores by 
comparison to the information already contained in the global ratings. 
However, factor score analysis does make it possible to identify individual 
examinees showing atypical (non-1. near) factor score patterns, and detailed 
linguistic analysis of the interview performance of individuals showing such 
patterns may be of both research and pedagogical interest. 

Questionnaire-based information obtained from the participating examinees 
indicates that, for the most part, the examinees felt that their optimum level 
of profic.ency had been adequately probed in interviews conducted by all three 
agenc.es. There were, however, appreciable differences in the examinees' 
affective reactions to the interviewing process, with a statistically 
significant tendency for the the examinees to view the FSl interviewing 
procedure as both more anxiety-producing and making more frequent use of what 
they considered to be "unfair" elidtation techniques. Also on the basis of 
questionnaire responses, the great majority of participating testers reported 
that, m their opinion, the interviews which they had conducted during the 
study were quite similar to the operational interviews given at their home 
agency with respect to interviewing procedures and accuracy of ratings, 
although tie atypical ly long testing day was cited in seme instances as a 
potential source of differences in both areas. Virtually all testers found 
their own involvement in the study to have been quite rewarding to them from 
personal and/or professional standpoints. 

With regard to extrapolation of study results, it is reasonable to assume, 
as a consequence of the study design, that the testers chosen for the study 
represented a random sample of the population of testers currently interviewing 
m that language at each agency. As such, the.r performance may be considered 
indicative of the probable total group characteristics of testers ,n that 
language/agency combination, without, however, ruling out the possibility that 
the luck of the draw" may have ,n some instances placed in the sample 
individuals having atypical characteristics in terms of their elicitation 
procedures or accuracy of rating vis-a-vis those of their colleagues. 

Considerable caution should be exercised in extrapolating the observed 
results for French and German testing to testing m other languages not 
formally investigated :n the study, both in view of the fact that the non- 
stud f ed languages have different populations of testers, and in consideration 
of possible linguistically-based differences across languages that would have 
an operational bearing on the interviewing process and/or on the re I iabi I ity of 
the ratmjs assigned, it should also be emphasized that the present study 
provides information about the test-retest comparability of the interviewing 
process on an across-agenc.es basis, and does not directly examine the quest. on 
of rating rehab.l.ty w.th.n a given agency (i.e., the extent to which each of 
several raters within one agency would agree with one another in repetitive 
.nterv.ew.ng of a given examinee), and it is quite possible to suggest that the 
level o* scoring agreement wi thin any one agency would be greater than that 

00 an inter "»fl« n cy basis. However, to the extent that the ILR scale- 
based interview is intended to represent a "comnon metric" of examinee 
performance, w.th identical meaning and interpretation across using agencies, 
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the results of the present study warrant close examination for possible 
conceptual or procedural implications that would arise from nold.ng such an 
objective. 
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APPENDIX A 

QUESTIONNAIRE FOR RARTICIRANrS IN INTERVIEW TESTINB STUDY 

We would first like to take this opportunity to thank you for participating as 
an examinee in our study of proficiency testing and scoring prcoaduresacross 
three government language-teaching agencies. In order to derive the greatest 
possible amount of useful information from the study* we would very much 
appreciate it if you would take a few minutes to answer the questions below, 
based on your own experiences as an interviewee for this project. 

In order to properly categorize the -first," "second,* and "third- interviews 
you took, we would ask you to indicate your name on the slip attached to the 
front of the questionnaire. This slip will be removed when the results are 
simuarised, and all data will be analysed and reported on an anonymous basis. 

A preaddressed, postpaid return envelope is enclosed for your convenience. In 
order for us to be able to prepare the final report on a timely basis, we 

w 2 uld 'fS 16 ?! that y 00 rtturn the completed questionnaire to us within one day 
of receipt if at all possible. 

Information ^concerning ^the proficiency level ratings that you were assigned 
during the study will be forwarded to you within approximately 5 days. 

Thank you again for your much-appreciated interest and participation in this 
important measurement study. 



Please answer each of the questions below by marking the correct space and/or 
by filling in a response as appropriate! 

(1) At which agency are you a student (or otherwise affiliated)? Check pjoe: 

I ] CIA 
I ] ELI 
I ] PSI 

(2) In which language were you tested? 

I ] French 
[ ] German 

HiEASE ANSWER THE FOLLOWING QUESTIONS IN TERMS OF THE FIRST OF THE THREE 
INTERVIEWS YOU TOOK DURING THE STUDY. 

(3) Did the opportunities the testers provided you to speak the language during 
the FIRST interview (in terms of the type and number of topics covered, range 
of performance required) adequately probe your maxinun proficiency level? 

I ] Yes 

! ] No 

[ ] Not Sure 

Comments? 
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(4) Did the testers during the FIRST interview use any elicitation techniques 
or cover any kinds of topics that you thought were in any way "unfair" or in 
sane cth?r way not a valid test of your speaking proficiency? 

I 1 Yes 
I 1 No 

If "yes," please describe briefly: 



(5) During the FIRST interview, did the testers appear to sake a conscious 
effort to put you at ease? 

I 1 Yes 
I ] No 

Gomnents? 



»EASE ANSWER THE POACHING IN TERMS OF THE SEOMP INTfctT/IEW YOC TOOK. 

(6) Did the opportunities the testers provided you to speak the language during 
the SECCND interview (in terns of the type and number of topics covered, range 
of performance required) adequately probe your max inn proficiency level? 



I ] Yes 

I ] No 

I ] Not Sure 

Ccunents? 



(7) Did the testers during the SEOOtC interview use any elicitation techniques 
or cover any kinds of topics that you thought were "unfair" or in sane other 
way not a valid test of your speaking proficiency? 

I J Yes 
[ ] No 

If "yes," please describe briefly: 



(t< During the SECOND interview, did the testers appear to make a conscious 
effort to put you at ease (regardless of whether it "worked")? 

I J Yes 
I ] No 

Ccunents? 
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of performance required) adequately probe your maximum proficiency level? 

I ] Yes 

I ] Mo 

I } Not Sure 

Comnents? 



(10) Did the testers during the fflno interview use any elicitation techn 
or cover any kinds of topics that you thought were in any way "unfair* or 
sane other way not a valid test of your speaking proficiency? 

I J Ks 
[ ] No 

If "yes*" please describe briefly: 



(11) During the ran© interview, did the testers appear to make a conscious 
effort to put you at ease? 



[ ] Yes 
[ ] No 



Comnents? 




Y?" You were most relaxed and at 



Comnents? 




Cements? 
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(15) In which of Ihe three interviews do you feel you least well demonstrated 
your optimum speaking proficiency? 



Cconents? 



Please use the space below to give any additional information, contents, or 
suygestinrx concerning the interviewing procedures or other aspects of the 
study, or •'our performance on the interviews. Where necessary, please identify 
the interview (s) as FIRST, SHEW), etc. lhank vou very such for vnur h#1pi 
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APPENDIX B 
INTERVIEW RATING COMPARABILITY STUDY 

EXAMINER FEEDBACK FORM 

We would like to take this opportunity to express our appreciation for 
your diligent and conscientious participation in the interview rating compara- 
bility study that will be completed with the third-agency testing at DLI today 
and tomorrow. Because of the quite busy schedule, which ia neceasitated for 
logistic reasons, it will not be possible foi us to arrange for fors.al group 
discussions and information sharing concerning the interviewing process 
and other aspects of the study (even though some interaction has been possible 
on a more informal baaia). In lieu of a formal feedback meeting as part of th* 
testing day" itself, we would greatly appreciate your taking the opportunity 
at some point ever the next two daya to respond to the questions below. In 
addition to answering the specific questions, we would appreciate any more 
general feedback or suggeations that you would care to provide esteeming any 
aspect of the study. We would ask you not to give your name when filling out 
the questionnaire, but we would appreciate your marking your language and 
agency affiliation in the apace provided at the end of the questionnaire. 

1. Do you feel that the interviewing procedures (elicitation techniques, use 
of props, role-plays, etc.) you usee during the study were the saoe as 
those you use in routine, day-to-day testing? Please explain briefly. 



Do you feel that the ratings you assigned during the study were, on the 
whole, more accurate, about as accurate, or less accurate than the ratings 
you typically mak* in routine testing at your agency? Please explain. 



Do you feel that the accuracy of your latings varied at certain times or 
points during the six-day testing period? 
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4. Did you notice any differences in the composition of the exaoinee groups 
at the different agencies with respect to overall levels of proficiency, 
examinee reactions to interview techniques* etc*? 



5. Do you f?el that participation in the project was in any vay interesting 
or beneficial to you? 



6. If a similar or expanded study of rating comparability vere to be conducted 
in the future, do you have any recommendations on additional factors that 
might be included in planning or carrying out the study? 



Your Language Agency 



ERLC 
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